Method and apparatus for implementing high-performance, scaleable data processing and storage systems

ABSTRACT

A data system architecture is described that allows multiple processing and storage resources to be connected to multiple clients so as 1) to distribute the clients&#39; workload efficiently across the available resources; and 2) to enable scaleable expansion, both in terms of the number of clients and in the number of resources. The major features of the architecture are separate, modular, client and resource elements that can be added independently, a high-performance cross-bar data switch interconnecting these various elements, separate serial communication paths for controlling the cross-bar switch settings, separate communication paths for passing control information among the various elements and a resource utilization methodology that enables clients to distribute processing or storage tasks across all available resources, thereby eliminating “hot spots” resulting from uneven utilization of those resources.

FIELD OF THE INVENTION

This invention relates to data processing systems and, in particular, todata processing systems involving the transfer, manipulation, storageand retrieval of large amounts of data.

BACKGROUND OF THE INVENTION

In data processing applications involving the transfer, manipulation,storage and retrieval of large amounts of data, the most seriousperformance limitations include (1) difficulties in moving data betweenusers who need access to the data and resources used to store or processthe data and (2) difficulties in efficiently distributing the workloadacross the available resources. These difficulties are particularlyapparent, for example, in disk-based storage systems in which thegreatest performance limitation is the amount of time needed to accessinformation stored on the disks. As databases increase in size,requiring more and more disks to store that data, this problem growscorrespondingly worse and, as the number of users desiring access tothat data increase, the problem is compounded even further. Yet thetrends toward both larger databases and an increased user population areoverwhelmingly apparent, typified by the rapid expansion of theInternet.

Current techniques used to overcome these difficulties include reducingaccess time by connecting users to multiple resources over various typesof high-speed communication channels (e.g., SCSI buses, fiber channelsand Infiniband busses) and using caching techniques in an attempt toreduce the necessity of accessing the resources. For example, in thecase of storage systems, large random-access memories are oftenpositioned locally to the users and are used as temporary, or cache,memories that store the most recently accessed data. These cachememories can be used to eliminate the need to access the storageresource itself when the cached data is subsequently requested and theythereby reduce the communication congestion.

Various distribution algorithms are also used to allocate tasks amongthose resources in attempts to overcome the workload distributionproblem. In all cases, however, data is statically assigned to specificsubsets of the available resources. Thus, when a resource subsettemporarily becomes overloaded by multiple clients simultaneouslyattempting to access a relatively small portion of the entire system,performance is substantially reduced. Moreover, as the number of clientsand the workload increases, the performance rapidly degrades evenfurther since such systems have limited scalability.

SUMMARY OF THE INVENTION

In accordance with one illustrative embodiment of the invention, usersare connected to access interfaces. In turn, the access interfaces areconnected to a pool of resources by a switch fabric. The accessinterfaces communicate with each client with the client protocol andthen interfaces with the resources in the resource pool to select thesubset of the resource pool to use for any given transaction anddistribute the workload. The access interfaces make it appear to eachclient that the entire set of resources is available to it withoutrequiring the client to be aware that that the pool consists of multipleresources.

In accordance with one embodiment, a disk-based storage system isimplemented by client interfaces referred to as host modules andprocessing and storage resources referred to as metadata and diskinterface modules, respectively.

The invention eliminates the prior art problems by enabling both clientinterfaces and processing and storage resources to be addedindependently as needed, by providing much more versatile communicationpaths between clients and resources and by allowing the workload to beallocated dynamically, with data constantly being directed to thoseresources that are currently least active.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and further advantages of the invention may be betterunderstood by referring to the following description in conjunction withthe accompanying drawings in which:

FIG. 1 is a block schematic diagram of a resource access systemconstructed in accordance with the principles of the present invention.

FIG. 2 is a block schematic diagram of an illustrative storage systemembodiment implemented with the architecture of FIG. 1.

FIG. 3 is a detailed block schematic diagram of a host interface module.

FIGS. 4A-4C, when placed together, form a flowchart illustrating thesteps in a process carried out by the host interface module in responseto a request from a client.

FIG. 5 is a detailed block diagram of a disk interface module.

FIG. 6 is a flowchart illustrating the processing steps performed bysoftware running in the disk interface module.

FIG. 7 is a detailed block schematic diagram of a metadata module.

FIGS. 8A and 8B, when placed together, form a flowchart illustratingprocessing steps performed by software running in the metadata module.

FIG. 9 is a detailed block schematic diagram of a switch module.

DETAILED DESCRIPTION

A block schematic diagram of a resource access system 100 in accordancewith an embodiment of the invention is shown in FIG. 1. The systemconsists of three components. Access interfaces 106-112 provide clients,such as client 102 and 104, with access to the system 100 and provideother access-related resources. A pool of resources 118-124 maycomprise, for example, data processing or storage devices. A switchfabric 114 and 116 interconnects the access interfaces 106-112 and theresources 118-124. Since the requirements for communicating controlinformation differ significantly from those for data communication, theswitch fabric consists of a control switch fabric 114 and a data switchfabric 116 in order to provide different paths and protocols for controland data. For example, control transfer protocols generally divide thecontrol information into relatively small packets that are transferredusing packet-switching technology. In contrast, data transfer protocolsgenerally consist of larger packets conveyed over a circuit-switchedfabric. The separation of the switch fabric into two sections 114 and116 allows each type of communication path to be optimized for itsspecific function and enables service requests to be transferred to aresource, via the control switch fabric 114 without interfering with thedata transferring capacity of the data switch fabric 116.

In accordance with the principles of the invention, the accessinterfaces 106-112 operate to virtualize the pool of resources 118-124,thereby making it appear to each client, such as clients 102 and 104,that the entire set of resources 118-124 is available to it withoutrequiring the client to be aware of the fact that that the pool is infact partitioned into multiple resources 118-124.

This virtualization is accomplished by enabling the access interfaces106-112 to serve as communication protocol terminators and giving themthe ability to select the subset of the resource pool 118-124 to use forany given transaction. An access interface, such as interface 106, isthus able to communicate with a client, such as client 102, using theclient's protocol for messages. The interface 106 parses a messagereceived from the client into a portion representing data and a portionconsisting of commands or requests for service. The interface 106 theninterprets those requests and distributes the workload and theassociated data across the pool of resources 118-124.

The distribution of the workload may entail accessing a number ofresources by the access interface and brings with it several majoradvantages. For example, it allows the workload to be distributed acrossthe available resources, preventing the “hotspots” typically encounteredwhen multiple clients independently attempt to access multipleresources. Since clients generally do not have knowledge about otherclients' activities, it is very difficult, if not impossible, for theclients themselves to achieve any such level of load balancing on theirown. In addition, it enables resources to be added non-disruptively tothe resource pool. Clients need not be aware that additional resourceshave been made available since the access interfaces themselves areresponsible for allocating resources to requests for service. This, inturn, allows the system capacity to be scaled to meet demand as thatdemand increases over time. Similarly, the ability of the accessinterfaces to distribute workloads allows the external connectivity tobe increased to accommodate additional clients, again without disruptingon-going activities with existing clients.

The inventive system can be used to construct resource allocationsystems for any type of resources. The remainder of this disclosuredescribes an embodiment which implements a disk-based storage system,but this embodiment should not be considered as limiting. In thisembodiment, the access interfaces 106-112 are referred to as “hostinterface modules” and the resources are disk storage devices. The diskstorage devices are connected to the switch fabric by “disk interfacemodules” and separate processing modules called “metadata” modules arealso provided.

The storage system embodiment is shown in FIG. 2. The storage system 200consists of a set of access modules 206-210 called host interfacemodules, and two types of resource modules: disk interface modules218-222 and metadata modules 212-214. The host interface modules 206-210provide one or more clients, of which clients 202 and 204 are shown,access to the system 200 and communicate with each client 202, 204 usingthe client's message passing protocol. The host interface modules206-210 parse requests from the clients 202, 204 for disk and filesystem accesses and distribute the storage load across the entire set ofdisks connected to the system 200, of which disks 226-230 are shown. Thehost interface modules are responsible for the logical allocation of thestorage resources.

The disk interface modules 218-222 each support up to 450 disks and areresponsible for the physical allocation of their disk resources. Thedisk interface modules provide data buffering, parity generation andchecking and respond to requests from the host interface modules foraccess to their associated disks.

The metadata modules 212-214 provide a processing resource thatmaintains the structure and consistency of file systems used in thesystem. They are used when the storage system serves as a standalonefile system, for example, in a networked environment, and hence assumesthe responsibility for maintaining the file systems. The data used todescribe the objects in the file system, their logical locations,relationships, properties and structures, is called “metadata.” Inapplications in which the storage system is directly attached to a hostthat implements this function itself, metadata modules are not neededand are accordingly not included in configurations intended for suchapplications. Since these applications and other storage systemapplications (e.g., HTTP server, web cache protocol server, and FTPserver applications) require a subset of the functionality needed forstandalone file systems, the illustrated embodiment is configured as astandalone file system, but the invention is equally effective indirect-attach applications. The following description applies equally tosystems configured for direct attachment.

The switch module 216 provides the command and data paths used tointerconnect the other three module types and contains both the controland data switches. In this embodiment of the invention, module 216 iscapable of passing a block of data, for example, two kilobytes, betweenarbitrary pairs of modules at approximate fixed time increments, forexample, approximately every four microseconds. Each host interface,disk interface and metadata module operates in full duplex mode, therebyenabling it to transmit and receive simultaneously at the aforementionedrate thereby supporting a system-level data bandwidth of up to Ngigabytes/second, with N the total number of host interface, diskinterface and metadata modules in the system.

The previously listed advantages of this architecture take the followingmore concrete forms when applied to the storage system. First, hostinterface modules are allowed to send incoming data to any availabledisk interface module for storage regardless of where that data mighthave been previously stored. This ability, in turn, distributes readaccesses across the full complement of disks avoiding the inevitablehotspots encountered in conventional storage systems in which disks arepartitioned into physical volumes and data must be directed to aspecified volume.

Second, additional metadata modules, disk interface modules and physicaldisks can be added at any time. Clients need not be aware thatadditional resources have been made available since knowledge of wheredata is physically stored is not visible to them. This allows thelogical space allocated to clients to far exceed the physical space thatis currently available. Physical disk space does not need to be addeduntil clients begin to use up a significant portion of the availablephysical space, which is typically much less than the allocated logicalspace.

Third, additional host interface modules can be added at any time toincrease the connectivity available to the current clients or to add newclients. Since all host interface modules have equal access to allresources, traditional data segmentation and replication is not neededto provide access to an expanded set of clients. For the same reason,clients can transfer data to multiple disks in a single transfer;clients are, in fact, unconcerned about where that data is physicallystored.

A more detailed diagram of a host interface module is shown in FIG. 3.Each host interface module 300 is composed of four major components: acentral processing unit (CPU) complex 324, a data complex 318, aninput/output (I/O) complex 302 for communicating with the host and aswitch interface 352. The CPU complex 324, in turn, consists of amicroprocessor CPU 332 with its associated level-one (internal) andlevel-two (external) caches 330, memory and I/O bus control logic 334,local random-access memory (RAM) 326, content-addressable memory (CAM)338. A peripheral bus 336 provides access to the CAM 338, the datacomplex 318, the switch interface 352, and, through an I/O buffer 328,to the I/O complex 302. A PCI bus 339 provides access over the datatransfer bus 350 to the data complex 318 and to two full-duplex channeladapters 340, 342 which connect to two full-duplex 10/100 megabitEthernet channels called the Interprocessor Communication channels(IPCs) 346, 348 used to communicate with other modules in the system.

The data complex 318 comprises a large (typically two-gigabyte),parity-protected data memory 322 supported with a memory controller 320that generates the control signals needed to access the memory 322 overa 128-bit data bus 323 and interface logic and buffers providing linksto the I/O complex 302, the switch interface 352 and, over the datatransfer bus 350, to the CPU complex 324. The memory controller 320responds to read requests from the other sections of the host interfacemodule and forwards the requested information to them. Similarly, itaccepts write requests from them and stores the associated data in thespecified locations in memory 322.

The I/O complex 302 is used to communicate with the clients via ports307-313. There are two versions of the I/O complex 302, one versionsupports four one-gigabit, full-duplex Ethernet ports and the secondversion supports four one-gigabit, full duplex Fibre Channel ports. Thesecond of these versions is typically used for systems directly attachedto hosts; the first version is used for network-attached storage systemsand is a preferred embodiment. Multiple protocols, including SCSI,TCP/IP, UDP/IP, Fibre Channel, FTP, HTTP, bootp, etc., are supported forcommunicating over these ports between clients and the host interfacemodules. These protocols are interpreted at the host interfaces 306-312.Commands (e.g., read or write a file, lookup a file or directory, etc.)are buffered in the local I/O memory 304 for access by the CPU softwarevia bus 314. Data received from the clients, via ports 307-313, is sentto the data memory 322 where it is buffered pending further action.Similarly, data passed from the storage system 300 to clients isbuffered in the data memory 322 while the I/O complex 302 generates theappropriate protocol signals needed to transfer that data to the clientthat requested it.

The switch interface 352 contains a buffer memory 354 and associatedlogic to accept, upon command from the CPU software over the peripheralbus 336, commands to transfer data, via bus 357, from the data complex318 to external modules. It buffers those commands and submits requeststo the switch (216, FIG. 2) for access to the destinations specified inthose commands. When a request is granted, the switch output logic 356commands the memory controller 322 to read the specified locations inmemory 322 and transfer the data to it to be forwarded to the intendeddestination. Similarly, the switch input logic 358 accepts data from theswitch at the full switch bandwidth and forwards it, along with theaccompanying address, to the data complex 318 via bus 364. Data istransferred from the output logic 356 to the switch and from the switchto the input logic 358 using, in each case, four serial,one-gigabit/second connections 360, 362 giving the host interface modulethe ability to transmit, and simultaneously to accept, data at a rate of500 megabytes/second. Similarly, the request and grant paths to theswitch are also both implemented with serial one-gigabit/second links.

When a request is received from a client over one of the Ethernet orFibre Channels 307-313, the I/O complex 302 generates the appropriatecommunication protocol responses and parses the received packet ofinformation, directing the request to buffer 304 to await processing bythe CPU software and any associated data to buffer 304 for subsequenttransfer to the data memory 322. The processing steps taken by thesoftware running on the host interface module CPU 332 are illustrated inthe flowchart shown in FIGS. 4A-4C.

In FIG. 4A, the process starts in step 400 and proceeds to step 402,where, the host interface receives a request from the client. Therequest always contains a file or directory “handle” that has beenassigned by internal file system processing to each data object. Thisfile system processing is typically done in the metadata module. Thehandle identifies that object and is sent to the client to be used whenthe client is making future references to the object. Associated witheach such handle is an “inode” which is a conventional data structurethat contains the object “attributes” (i.e., the object size and type,an identification of the users entitled to access it, the time of itsmost recent modification, etc.) of the file or directory. Each inodealso contains either a conventional map, called the “fmap”, or a handle,called the “fmap handle”, that can be used to locate the fmap. The fmapidentifies the physical locations, called the global physical diskaddresses (GPDAs), of the component parts of the object indexed by theiroffsets from the starting address of that object. In step 404, uponreading the request from the request buffer 304, the CPU softwareextracts the object handle from the request.

Next, in step 406, the CPU software queries the local CAM memory 338,using the extracted object handle as a key, to determine if the desiredinode information is already stored in host interface local memory 326.If the inode information is present in the memory 326, the CAM memoryaccess results in a “hit” and the CAM memory 338 returns the address inlocal memory 326 where the inode information can be found. In step 408,this address is then used to fetch the inode information from the localmemory 326.

If the inode information is not present in the local memory as indicatedby a CAM memory “miss”, then, in step 410, the software uses the IPClinks 346 and 348 to contact the appropriate metadata module (which isidentified by the object handle) for the needed information which isreturned to it also over the IPC links. Once the software has locatedthe inode (or a critical subset of the contents of the inode), itverifies that requested action is permitted (step 412). If the action isnot permitted, an error is returned in step 414 and the process ends instep 415.

Alternatively, if the requested action is permitted, then, in step 416,the CPU software determines which response is required. If stored datais to be read, the process proceeds, via off-page connectors 419 and 421to step 418 shown in FIG. 4C where the CPU software determines whetherthe fmap or the fmap handle required to honor that request is in theinode. If the fmap itself is too large to be contained in the inode, theprocess proceeds to step 420 where the software again consults its CAM338, this time using the fmap handle as the key, to determine if thefmap pointed to by that handle is stored locally. If it is not, theprocess proceeds to step 422, where the software extracts the GPDA forthe fmap from the fmap handle and sends a request for the fmap, or forthe next level of fmap pointers, over the IPC links 346, 348 to the diskinterface module identified by the GPDA, which returns the fmap, or thepage containing the next level of fmap pointers, through the switchmodule and switch interface 352 to the host interface module data memory322. The software can then access this information over the datatransfer bus 350. In step 424, the software checks the information inthe local data memory 322 to determine whether it has obtained the fmap.If not the process returns to step 420 and continues this process untilit finds the fmap and the GPDA of the data itself, during each iterationof the process checking its CAM 338 at step 422 to determine if thedesired information is cached in the data memory 322.

When the software locates the GPDA of the desired data either throughthe process set forth in steps 420-424 or if it was determined that thefmap was in the inode in step 418, in step 426, the software againchecks the CAM 338 to determine if the data resides in the hostinterface data memory 322. If the data is not in the data memory 322, instep 428, the software sends a read request for the data to the diskinterface module to retrieve that data identified by the GPDA. Once thedata is in the host interface data memory 322, in step 430 the softwaresends a response over the peripheral bus 336 via the I/O buffer 328 tothe I/O complex 302 indicating the location in the data memory 322 wherethe desired information resides, thereby enabling the I/O complex 302 tocomplete the transaction by returning the requested information to theclient. The least-recently-used (LRU) replacement algorithm is used tomanage the local memory 322 and data memory caches. The process thenends in step 432.

If in step 416, a write operation is requested, the process proceeds,via off-page connectors 417 and 433, to steps 434-440 in FIG. 4B. Priorto any writes to disk, the disk interface module preallocates or assigns“allocation units” (AUs) of physical disk space to each host interfacemodule. Each allocation unit consists of a 512-kilobyte segment spreadevenly across a plurality of (for example, four) disks. The diskinterface module sends to each host interface module over an IPC channel346, 348, the logical addresses of the allocation units that have beenset aside for it. The host interface module then uses these logicaladdresses to specify the location of an object. Accordingly, the GPDAassigned by a host interface module to identify the location of aparticular object specifies the “system parity group number” (SPGN),“zone” and offset within the zone where that object can be found. Duringinitialization, the system determines the storage topology and defines amapping associating SPGNs and specific disk interface modules. Thislevel of mapping provides additional virtualization of the storagespace, enabling greater flexibility and independence from the specificcharacteristics of the physical disks. The zone defines a particularregion within a given SPGN. The disk module reserves certain zones fordata represented by that allocation unit. The disk interface module alsoreserves certain zones for data that is known to be frequently accessed,for example, metadata. It then allocates these zones near the center ofthe disk and begins allocating the rest of the space on the disk fromthe center outward towards the edges of the disk. Consequently, most ofthe disk activity is concentrated near the center, resulting in lesshead movement and faster disk access.

If, in step 416, it is determined that the host interface modulereceived a write request from the client, the process proceeds, viaoff-page connectors 417 and 433, to step 434 where the host interfacemodule forwards the request to the metadata module or other file system.In parallel, the software assigns the associated data, which is bufferedby the I/O complex 302, to the appropriate preallocated allocation unitin the data memory 322 and sends a request to the switch through theswitch interface 352 to enqueue the data for transmission, in,typically, two-kilobyte packets, to the corresponding disk interfacemodule.

Next, in step 436, the software then sends the GPDA(s) of the newlocation for that data over an IPC channel 346, 348 to the appropriatemetadata module and broadcasts that same information over IPC channels346, 348 to all other host interface modules in the system so that theycan update their copies of the fmap in question. Alternatively, thesoftware can broadcast invalidate messages to the other host interfacemodules causing them to invalidate, rather than update, the associatedfmap. Then in step 438, the host interface module waits foracknowledgements from the disk interface module and metadata module.Acknowledgements are sent to the host interface module over the IPCchannels 346, 348 from the disk interface module when the data has beenreceived and from the metadata module when it has updated its copy ofthe fmap. When both acknowledgements have been received, in step 440,the CPU software signals the I/O interface 302 to send anacknowledgement to the client indicating that the data has been acceptedand is secure. By “secure” it is meant that the data has been stored intwo independent modules (the host interface module and a disk interfacemodule) and the associated metadata updates either have also been storedin two modules (the host interface module and a metadata module) or alog of those updates has been stored on a second module. The processthen ends in step 442

The preallocation of allocation units has several significant advantagesover the current state of the art in disk storage. In particular, thehost interface module is able to respond to write requests withouthaving to wait for disk space to be allocated for it, allowing it toimplement the request immediately and to acknowledge the write much morerapidly. In effect, the preallocation gives the host interface moduledirect memory access to the disk. This ability to respond quickly isalso enhanced by the fact that the data write does not need to wait forthe metadata update to be completed.

As discussed below each disk interface module also maintains cachedcopies of allocation units. When a cached copy of an allocation unit ina disk interface module has been filled and written to disk, the diskinterface module releases the cached copy, preallocating a newallocation unit, both on disk and in its cache, and sending the hostinterface module a message to that effect over the IPC 346, 348. Thedisk interface module can then reuse the cache memory locationspreviously occupied by the released allocation unit. At any given time,each host interface module has several allocation units preallocated forit by each disk interface module. Host interface modules select whichallocation unit to use for a given write based solely on the recentactivity of the associated disk interface module. This enables theworkload to be distributed evenly across all disks, providing the systemwith the full disk bandwidth and avoiding the serious performancelimitations that are frequently encountered in standard storage systemswhen multiple hosts attempt to access the same disk at the same time.

In accordance with one aspect of the invention, the data assigned to agiven allocation unit may come from multiple hosts and multiple files oreven file systems. The only relationship among the various data itemscomprising an allocation unit is temporal; they all happened to bewritten in the same time interval. In many cases, this can offer asignificant performance advantage since files, or portions of files,that are accessed in close time proximity tend to be re-accessed also inclose time proximity. Thus, when one such file is accessed, the otherswill tend to be fetched from disk at the same time obviating the needfor subsequent disk accesses. On the other hand, this same processobviously gives rise to potential fragmentation, with the dataassociated with a given file ending up stored in multiple locations onmultiple disks. Procedures to mitigate the possible deleterious effectof fragmentation when those files are read are discussed below. Thistechnique for allowing data to be stored anywhere, without regard to itscontent or to the location of its prior incarnation, allows forsuperior, scalable performance.

As in any storage system, it is necessary to identify disk sectors thatcontain data that is no longer of interest, either because the file inquestion has been deleted or because it has been written to anotherlocation. This is accomplished in the current invention by maintaining areference count for each page stored on disk. When a page is written toa new location, new fmap entries must be created to point to the data asdescribed in the preceding paragraphs. Until the pages containing theold fmap entries have been deleted, other pages pointed to by otherentries on those same pages will now have an additional entry pointingto them. Accordingly, their reference counts must be incremented. Whenan fmap page is no longer needed (i.e., when no higher-level fmap pointsto it) it can be deleted and the reference counts of the pages pointedto by entries in the deleted fmap page must be decremented. Any pagehaving a reference count of zero then becomes a “free” page and thecorresponding disk locations can be reused.

This procedure allows volumes to be copied virtually instantaneously.Volume copies are commonly used to capture the state of a file system ata given instant. To effect a volume copy, it is only necessary to definea new volume with pointers to the fmaps of the files that are beingcopied. When a page in a copied file is to be modified, the new fmapentries point to the new location while the old fmap entries point tothe original, static, version of the file. As a result, unmodified pagesare now pointed to by more than one fmap entry and their referencecounts are incremented accordingly to prevent their being deleted aslong as any copy of the volume is still of interest.

FIG. 5 illustrates in more detail the construction of a disk interfacemodule 500. All disk interface modules contain the same construction andare interchangeable. The construction of each disk interface module issimilar to the construction of a host interface module shown in FIG. 3and similar parts have been given corresponding numeral designations.These parts operate in a fashion identical with their correspondingcounterparts in FIG. 3. For example, data memory 322 corresponds to datamemory 522. The major differences between the two modules lie in the I/Ocomplex 502 and in the data complex 518. The I/O complex 302 in eachhost interface module is replaced in each disk interface module with acomplex 502 consisting of five one-gigabit, full-duplex Fibre Channelinterfaces (504-515), each containing the logic needed to send data toand to retrieve data from disk drives from various manufactures overfive Fibre channels 507-517. These Fibre Channel interfaces 504-515 areused to communicate with sets of five disks, each channel supporting upto 90 disks, enabling each disk interface module 500 to control up to450 disks. Parity information is stored along with the data, so twentypercent of the disk space is used for that purpose. However, each diskinterface module can still manage nearly 33 terabytes of data using73-gigabyte disks.

All disks connected to a disk interface module are dual-ported with thesecond port connected to a second disk interface module. In normaloperation, half the disks connected to any given disk interface moduleare controlled solely by it. The disk interface module assumes controlover the remaining disks only in case of a fault in the other diskinterface module that has access to the disks. This prevents the loss ofany data due to the failure of a single disk interface module.

The data complex 518 in each disk interface module is identical to thedata complex 318 in a host interface module 300 except for the additionof a special hardware unit 521 that is dedicated to calculating theparity needed for protection of the integrity of data stored on disk.During a disk write operation, the memory controller 520 successivelytransfers each of a set of four blocks of data that is to be written todisk to the parity generator 521 which generates the exclusive-or ofeach bit in the first of these blocks with the corresponding bit in thesecond block, the exclusive-or of these bits with the corresponding bitsin the third block and the exclusive-or of these bits with theircounterparts in the fourth block. This resulting exclusive-or block isthen stored in memory 522 to be transferred, along with the data, to thefive disks over five independent channels 507-515. The size of theblocks is referred to as the “stripe factor” and can be set according tothe application. The specific disk used to store the parity block is afunction of the allocation unit being stored. This allows the parityblocks to be spread evenly across the five disk channels 507-515.

The steps taken by the disk interface module CPU 532 in response to readand write requests are illustrated in the flowchart in FIG. 6. Theprocess begins in step 600 and proceeds to step 602 in which a new eventis received by the disk interface module. On detecting that a new eventhas occurred, i.e., that either data has been received over the switchor a request has been received over the IPC, the CPU software in thetarget disk interface module determines the appropriate action in step604. If the event is a read request, the process proceeds to step 610 inwhich the CPU software checks the disk interface module CAM 538, usingthe GPDA provided in the request as a key, to determine if the desiredobject is cached in its local data memory 522. If the data is cached,the process proceeds to step 616, described below.

Alternatively, if in step 610, it is determined that the data is notcached, the process proceeds to step 612 in which software sends arequest to the I/O complex 502 directing that the requested data beread, along with, typically, several adjacent disk sectors inanticipation of subsequent reads, and stored in an assigned location inthe data memory 522. The number of additional pages to be read isspecified in the read request generated in the requesting host interfacemodule, this number is determined from an examination of the type offile being read and other information gleaned by the host interfacemodule from the attributes associated with the file. The additionalpages are cached in case they are subsequently needed and overwritten ifthey are not.

The CPU software then polls the I/O complex 502 to determine when theread is complete as illustrated in step 614. When the data is located inthe data memory 522, either through a cache hit or by being transferredin from disk, in step 616, the software sends a message to the switchinterface 552 thereby enqueuing the data for transmission to therequesting host interface module. While any data item cached in the diskinterface module data memory 522 as the result of a write must also becached in some host interface module data memory, the data item is notnecessarily cached in the memory of the host interface module making theread request. Similarly, data may be cached in a disk interface moduledue to a prior read from some host interface module other than the onemaking the current request.

If, in step 604, it is determined that a write request has beenreceived, the process proceeds to step 606. Allocation units that havebeen preallocated to host interface modules are represented by reservedcache locations in the disk interface module local memory 522 and by“free space” on disk, that is, by sectors that no longer store any dataof interest. When the switch interface 552 receives write data from ahost interface module, it stores the data directly in the preassignedallocation unit space in its data memory 522 and enqueues a message forthe CPU software that the data has been received. Upon receiving themessage, the software sends an acknowledgement over the IPC links 546,548 to the appropriate host interface module as shown in step 606 andenqueues the data for transfer to disk storage, via the I/O complex 502,as shown in step 608. The process then terminates in step 618.

When disk bandwidth is available, or when space is needed to accommodatenew data, the CPU software instructs the I/O complex 502 to transfer todisk storage the contents of one allocation unit cached in the datamemory 522. If possible, the CPU software selects an allocation unitthat is already full to store to disk and, of full allocation units, itselects an allocation unit that is “relatively inactive.” One typicalmethod for performing this selection is to select the allocation unitthat has been least recently accessed (according to one of severalwell-known least-recently-used algorithms) but other criteria could alsobe used. For example, one of the pre-allocated allocation units may beselected at random. Alternatively, the switch unit could keep track ofthe length of queues of transactions awaiting access to the various diskmodules. This information could then be communicated back the hostinterface module and used to make a decision as to which allocation unitto select based on actual disk activity.

As previously noted, the parity-generation hardware 521 is used to formthe parity blocks that are stored along with the data, therefore, inorder to store one allocation unit, 128 kilobytes of data and parityinformation are sent over each of the five channels 507-515 to fivedifferent disks. Once the data has been stored on disk, the softwaresends a message to the relevant host interface module over the IPC links546, 548 informing the host interface module that the contents of theallocation unit can now be released. However, those contents are notoverwritten in either the host interface module or the disk interfacemodule until the space is actually needed, thereby allowing for thepossibility that the information might be requested again before it isexpunged and hence can be retrieved without having to access physicaldisk storage.

Note that since an allocation unit is written as a unit, the parityinformation stored along with each disk stripe never has to be read andupdated, reducing by ¾ths the number of disk accesses that wouldotherwise be needed to store a single page. For example, in prior artsystems, the prior contents of the page to be stored has to be read, theparity page has to be read and modified based on the change between thenew and old contents of the page in question, and the new page and theparity page both have to be stored. In the inventive system, the onlytime a parity page normally has to be read is when the data on somesector fails the standard cyclic residue code (CRC) check always used toprotect data stored on disk. In this event, the parity sector is readand, in combination with the three error-free sectors, is used toreconstruct the contents of the defective sector.

As previously noted, the policy of writing data to arbitrary locations,while offering major performance advantages, can result in fragmentationof files that are only partially updated. Since each host interfacemodule can use any allocation unit at its disposal, and, in fact,selects allocation units solely on the basis of the recent activity ofthe associated disks, files may well be split up among multiple diskinterface modules. This tendency toward fragmentation is mitigated by awrite-back policy. That is, when a host interface module reads a filethat has been fragmented, it follows that read with a write, placing allthe file fragments, or all that will fit, in the same allocation unit.The previously described technique for ensuring that newly written dataand metadata are consistent is, of course, used with write-backoperations as well.

Another potential inefficiency resulting from the “write anywhere”policy is that sections of allocation units are gradually replaced bymore up-to-date versions written elsewhere, leaving holes in thoseallocation units that represent wasted disk space unless they areidentified and reused. Since the reference count technique describedearlier allows those sections to be identified, they can, in fact, bereused. To make their reuse more efficient, the software running on thedisk interface modules CPU 532, as a background task, identifies thoseallocation units having more than a predetermined percentage of unusedspace and sends the GPDAs of the still-valid sectors to a host interfacemodule so that the vectors can be read and rewritten more compactly.

The detailed construction of a metadata module 700 is shown in FIG. 7.The metadata module 700 differs from the host interface module 300 anddisk interface module 500 in two basic ways: The metadata module 700 hasno I/O complex since it does not communicate with either clients ordisks; and the data complexes present in the host interface modules anddisk interface modules are eliminated and their large data memoriesreplaced by relatively a small memory 754 that serves asstore-and-forward buffer. Data destined to be stored through the switchoutput 756 and connections 760 is first transferred, using a DMA engine753, from the CPU's local memory 726 into the buffer memory 754 beforebeing enqueued for transfer. Similarly, data received over the switchvia connections 762 and switch input 758 is transferred from the inputbuffer 754 directly into preassigned locations in local memory 726.

Since the local memory 726 in the metadata module 700 stores all datareceived over the switch, it is considerably larger than its counterpartin the host interface 300 and disk interface modules 500, normallycomparable in size to the latter modules' data memories, 322 and 522,respectively. The local memory 726 is used primarily for caching inodesand fmaps. The other elements shown in FIG. 7 are similar in functionand implementation to the corresponding elements shown in FIGS. 3 and 5.

The purpose of the metadata module 700 is to maintain the file systemstructure, to keep all inodes consistent and to forward current inodesto host interface modules that request them. When a new file ordirectory is created, it is the responsibility of the metadata module togenerate the associated inode and to insert a pointer to it into aB-tree data structure used to map between inodes and GPDAs. Similarly,when a file or directory is deleted, the metadata module must delete itsassociated inode, as well as those of all its descendents, and modifythe B-tree data structure accordingly.

When a host interface module receives a request from a client thatrequires inode information that the host interface module cannot find inits own local memory, it uses the IPC links to query the metadata moduleassociated with the file system in question. The steps taken by thesoftware running on a metadata module CPU 732 to service a typicalrequest are depicted in FIGS. 8A and 8B.

In FIG. 8A, the process begins in step 800 and proceeds to step 802where a request is received by the metadata module. All requests to ametadata module for an object are accompanied by a handle that includesan “inode number” uniquely identifying the object, or the parent of theobject, being requested. These unique inode numbers are assigned by thefile system to each of its files and directories. The handle used by aclient to access a given file or directory includes the inode number,which is needed to locate the object's associated inode. In step 804,the metadata module checks its CAM 738 using that inode number as a key.If the inode information is in the CAM 738, the process proceeds, viaoff-page connectors 815 and 819, to step 816, discussed below.

If the inode information is not in the local memory 726, as indicated bya cache “miss,” the CPU software then searches through an inode B-treedata structure in memory 726 to find the GPDA of the inode data asindicated in step 806. If the necessary B-tree pages are not present inlocal memory, the process proceeds to step 808 where the software sendsa message over IPC links 746, 748 to the appropriate disk interfacemodule requesting that a missing page be returned to it over the switch.The metadata module 700 then waits for a response from the diskinterface module (step 810.)

In step 812, the CPU software examines either the cached data from step806 or the data returned from the request to the disk interface modulein step 810 to determine if the data represents a leaf page. If not, theprocess returns to step 808 to retrieve additional inode information. Ifthe data does represent a leaf node, then the process proceeds, viaoff-page connectors 813 and 817, to step 814. Once the metadata module700 has located the GPDA of the inode itself (the desired leaf node), instep 814, the metadata module 700 sends a request over the IPC links746, 748 for the page containing that inode. At this point, the inodeinformation has been obtained from the CAM 738 in step 804 or byretrieving the information in step 814.

The process then proceeds to step 816 where a determination is madeconcerning the request. If the request received from the host interfacemodule was to return the handle associated with a named object in adirectory having a given inode number, the retrieved inode is that ofthe directory and the process proceeds to step 820.

To fulfill the request, the metadata module 700 must read the directoryitself as shown in step 820. The CPU software first queries its CAM 738,using the directory's GPDA as a key, to determine if the directoryinformation is cached in its local memory 726. If the desiredinformation is present, then the process proceeds to step 818. If thedirectory, or the relevant portion of the directory is not cached, thesoftware must again send a message over the IPC to the disk interfacemodule storing the directory requesting that the directory informationbe returned to it through the switch as indicated in step 822. Once ithas access to a directory page, it searches the page to find the desiredobject. If the object is not found in step 824, the process returns tostep 820 to obtain a new page. Eventually it locates the named objectand its associated inode number.

Finally, once the metadata module has located either the inode of theobject specified by the handle or the inode of the named object,depending on the specific request, it forwards the requested informationon to the requesting host interface module as set forth in step 818. Theprocess then ends in step 826.

A detailed diagram of the switch module is shown in FIG. 9. The switchmodule 900 is composed of three major components: a crossbar switchcomplex 906 providing non-blocking, full-duplex data paths betweenarbitrary pairs of host interface modules, disk interface modules andmetadata modules; an IPC complex 904 composed of switches 942 for twosets of full-duplex, serial, 10/100 Ethernet channels 938 and 940 thatprovide messaging paths between arbitrary pairs of modules; and aconfiguration management complex 902 including system reset logic 924and the system clock 908. The switch module is implemented as aredundant pair for reliability and availability purposes, however, onlyone of the pair is shown in FIG. 9 for clarity.

The I/O processor 954 in the crossbar switch complex 906 acceptsrequests from the switch interfaces 356, 556 and 756 on the hostinterface modules, disk interface modules and metadata modules,respectively over the request links and grants access over the grantlinks. Each module can have one request outstanding for every othermodule in the system or for any subset of those modules. During eachswitch cycle, the arbiter 950 pairs requesting modules with destinationmodules. The arbiter assigns weights to each requester and to eachdestination. These weights can be based on any of several criteria,e.g., the number of requests a requester or destination has in itsqueue, the priority associated with a submitted request, etc. Thearbiter then sequentially assigns the highest weight unpaireddestination to the unpaired requester having the highest weight amongthose requesting it. It continues this operation as long as any unpairedrequester is requesting any, as yet, unpaired destination.

The I/O processor 954 then sends each requesting module, over theappropriate grant link, the identity of the module with which it hasbeen paired and to which it can send a data packet during the nextswitch cycle. The arbiter 950 sets the crossbar switch 952 to theappropriate state to effect those connections.

The switch 952 itself consists of four sets of multiplexers, onemultiplexer from each set for each destination, with each multiplexerhaving one input from each source. Switch cycles are roughly fourmicroseconds in duration, during which time two kilobytes of data aresent between each connected pair with a resulting net transfer rate ofapproximately 500 megabytes/second per connected pair.

The function of the IPC switches 942 is to connect source anddestination IPC ports 944 long enough to complete a given transfer. Thestandard IEEE 802.3 SNAP (sub-network access protocol) communicationprotocol is used consisting of a 22-byte SNAP header followed by a21-byte message header, a data packet of up to 512 bytes and a 32-bitcyclic residue code (CRC) to protect against transmission errors.

The configuration management complex 902 coordinates system boot andsystem reconfiguration following faults. To support the first of theseactivities, it implements two external communications links: one 936giving access through the PCI bus 928 via full-duplex, serial, 10/100Ethernet channel 932; and the other 912 giving RS-232 access 914 throughthe peripheral bus 922. To support the second activity, it implementsthe reset logic 924 for the entire system. It also implements anddistributes the system clock 908.

The disclosed invention has several significant fault-tolerant features.By virtue of the fact that it is implemented with multiple copies ofidentical module types and that all of these modules have equalconnectivity to all other modules, it can survive the failure of one ormore of these modules by transferring the workload previously handled byany failed module to other modules of the same type. The switch fabricitself, of course, is a potential single point of failure since allinter-module communication must pass through it. However, as mentionedin the previous section, the switch in the preferred implementation isimplemented with two identical halves. During the initializationprocess, the two configuration management complexes 902 communicate witheach other, via the IPC channels, to determine if both are functioningproperly and to establish which will assume the active role and with thestandby role. If both switch halves pass their self-diagnostic tests,both sets of IPC channels are used and the configuration managementcomplexes 902 cooperate in controlling the system configuration andmonitoring its health. Each switch half, however, supports the full databandwidth between all pairs of modules, therefore only the active halfof the switch is used for this purpose. If one switch half becomesinoperative due to a subsequent failure, the configuration managementcomplexes cooperate to identify the faulty half and, if it is the halfon which the active configuration manager resides, transfer that role tothe former standby half. The surviving configuration managercommunicates the conclusion to the other system modules. These modulesthen use only the functioning half of the switch for all furthercommunication until notified by the configuration manager that bothhalves are again functional. Although the IPC bandwidth is halved whenonly one switch half is operational, the full data bandwidth and allother capabilities are retained even under these circumstances.

Several complementary methods are used to identify faulty modules,including (1) watchdog timers to monitor the elapsed time between thetransfer of data to a module and the acknowledgement of that transferand (2) parity bits used to protect data while it is being stored inmemory or transferred from one point to another. Any timeout or parityviolation triggers a diagnostic program in the affected module ormodules. If the violation occurred in the transfer of data betweenmodules, the fault could be in the transmitting module, the receivingmodule or in the switch module connecting the two so the diagnosticroutine involves all three modules checking both themselves and theirability to communicate with each other. Even if the diagnostic programdoes not detect a permanent fault, the event is logged as a transient.If transient event recurs with a frequency exceeding a settableparameter, the module involved in the greatest number of such events istaken off line and the failure treated as permanent, thereby triggeringmanual intervention and repair. If transients continue, other moduleswill also be taken off line as a consequence until the fault isisolated.

Byte parity is typically used on data stored in memory and variouswell-known forms of vertical parity checks and cyclic-residue codes areused to protect data during transfer. In addition, in the storage systemembodiment described here, data tags consisting of 32-bit verticalparity check information on each data page are stored on disk separatelyfrom the data being protected. When data is retrieved from disk, the tagis also retrieved and appended to the data. The tag is then checked atthe destination and any discrepancy flagged. This provides protectionnot only from transmission errors but also from disk errors that resultin reading the wrong data (or the wrong tag). This latter class oferrors can result, for example, from an addressing error in which thewrong sector is read from disk or from a write current failure in whichold data is not overwritten.

Another important fault-tolerant feature of the storage systemembodiment of the invention is the requirement that all data andmetadata be stored on at least two different modules or onparity-protected disk before the receipt of any data is acknowledged.This guarantees that the acknowledged data will still be availablefollowing the failure of any single module. Similarly, data stored ondisk is protected against any single disk failure, and against anysingle disk channel failure, by guaranteeing that each data blockprotected by a parity block is stored on a different physical disk, andover a different disk channel, from all other blocks protected by thesame parity block and from the disk storing the parity block itself.

Finally, the fact that all disks are dual-ported to two different diskinterface modules guarantees that data can still be retrieved should anyone of those disk interface modules fail. Following such an event andthe resulting reconfiguration, all subsequent accesses to data stored onthe affected disks are routed through the surviving disk interfacemodule. While this may result in congestion because the surviving diskinterface module is now servicing twice as many disks, it retains fullaccessibility. In addition, the previously described load-balancingcapability of the system will immediately begin redistributing theworkload to alleviate that congestion.

Similar protection against host interface module failures can beachieved by connecting clients to more than one host interface module.Since all host interface modules have full access to all systemresources, any client can access any resource through any host interfacemodule. Full connectivity is retained as long as a client is connectedto at least one functioning host interface module and, connection tomore than one host interface module provides not only protection againstfaults, but also increased bandwidth into the system.

The architecture described in the previous paragraphs exhibits severalsignificant advantages over current state-of-the-art storage systemarchitectures:

1) It is highly scaleable. Host interface modules, disk interfacemodules and metadata modules can all be added independently as neededand their numbers can be independently increased as storage throughputor capacity demands increase. A system using a 16-port crossbar switch,for instance, can support any combination of host interface modules,disk interface modules and metadata modules up to a total of 16. Thiswould allow a system to be configured, for example, to give 32 directlyconnected clients access to over 40 terabytes of data (using 36-gigabytedisks) supported by two metadata modules. Obviously, even largerconfigurations can be realized with larger IPC and wider crossbarswitches.

2) Since writes can be directed to arbitrary disk interface modules,demand can be equalized across all disk resources, ensuring thatthroughput will increase nearly linearly with the number of diskinterface modules in the system. Further, writes can take place inparallel with fmap updates thereby decreasing the latency between theinitiation of a data write and the acknowledgement that it has beenaccepted. Since both the data and the metadata associated with a newwrite are always stored in two independent places before that write isacknowledged, write acknowledgements can be issued before data isactually stored on disk while still guaranteeing that the data issecure.

3) Relegating metadata operations to modules designed for that purposenot only enables faster metadata processing but, in addition, allows thehost interface and disk interface modules to be structured as efficientdata pipes, with the bulk of local memory partitioned as abi-directional buffer. Since the client's communication protocol isterminated in the host interface module I/O complex, the bulk of datapassing through this data memory does not need to be examined by thehost interface module CPU software.

Although an exemplary embodiment of the invention has been disclosed, itwill be apparent to those skilled in the art that various changes andmodifications can be made which will achieve some of the advantages ofthe invention without departing from the spirit and scope of theinvention. For example, it will be obvious to those reasonably skilledin the art that, although the description was directed to particularembodiments of host interface modules, disk interface modules, metadatamodules and switch modules, that other designs could be used in the samemanner as that described. Other aspects, such as the specific circuitryutilized to achieve a particular function, as well as othermodifications to the inventive concept are intended to be covered by theappended claims

1. Apparatus for providing high-performance, scalable data storageservices from a plurality of storage devices to a client in response todata storage requests, each data storage request including a data objectidentifier that identifies a data object to be stored, the apparatuscomprising: a plurality of storage interface modules, each of whichstores data directed to logical addresses into physical locations ineach of the storage devices; and a host interface module that receivesdata storage requests and, in response to each storage request and basedon relative activities of the storage interface modules, dynamicallyselects logical addresses to which a data object identified by thatrequest is stored, so that data storage activity will be dynamicallydistributed across the plurality of storage devices.
 2. The apparatus ofclaim 1 wherein each of the storage interface modules comprises amechanism that generates preallocation logical addresses which identifyunused space in the plurality of storage devices.
 3. The apparatus ofclaim 2 wherein each of the storage interface modules comprises acommunication module that transmits preallocation logical addresses tothe host interface module.
 4. The apparatus of claim 3 wherein the hostinterface module comprises a selection mechanism that, in response toeach data storage request selects, from preallocation logical addressestransmitted to the host interface module, logical addresses to which adata object identified by that request is stored.
 5. The apparatus ofclaim 1 wherein the host interface module comprises a local memorycontaining logical addresses identifying a storage location of eachstored data object
 6. The apparatus of claim 5 further comprising ametadata module having a metadata memory containing logical addressesidentifying a storage location of each stored data object.
 7. Theapparatus of claim 6 wherein the host interface module comprises acommunication module that, after logical addresses to which a dataobject identified by a data storage request is stored have beenselected, sends the selected logical addresses to the metadata module inorder to update the metadata memory in the metadata module to representa current storage location of that data object.
 8. The apparatus ofclaim 7 wherein the host interface module, in response to a dataretrieval request received from the client and including a data objectidentifier, accesses the local memory in order to determine a currentstorage location of that data object.
 9. The apparatus of claim 8wherein the host interface module comprises a mechanism that accessesthe metadata memory when a current storage location of that data objectcannot be determined from the local memory.
 10. The apparatus of claim 1further comprising a mechanism operable on initialization of theapparatus and upon the topology of the storage devices for mappinglogical addresses to physical locations in each of the storage devices.11. A method for providing high-performance, scalable data storageservices from a plurality of storage devices to a client in response todata storage requests, each data storage request including a data objectidentifier that identifies a data object to be stored, the methodcomprising: (a) providing a plurality of storage interface modules, eachof which stores data directed to logical addresses into physicallocations in each of the storage devices; and (b) providing a hostinterface module that receives data storage requests and, in response toeach storage request and based on relative activities of the storageinterface modules, dynamically selects logical addresses to which a dataobject identified by that request is stored, so that data storageactivity will be dynamically distributed across the plurality of storagedevices.
 12. The method of claim 11 wherein step (a) comprises, in eachstorage interface module, generating preallocation logical addresseswhich identify unused space in the plurality of storage devices.
 13. Themethod of claim 12 wherein step (a) comprises in each storage interfacemodule transmitting preallocation logical addresses to the hostinterface module.
 14. The method of claim 13 wherein step (b) comprises,in response to each data storage request, selecting, from preallocationlogical addresses transmitted to the host interface module, logicaladdresses to which a data object identified by that request is stored.15. The method of claim 11 further comprising providing the hostinterface module with a local memory containing logical addressesidentifying a storage location of each stored data object.
 16. Themethod of claim 15 further comprising providing a metadata module havinga metadata memory containing logical addresses identifying a storagelocation of each stored data object.
 17. The method of claim 16 whereinstep (b) comprises, after logical addresses to which a data objectidentified by a data storage request is stored have been selected,sending the selected logical addresses to the metadata module in orderto update the metadata memory in the metadata module to represent acurrent storage location of that data object.
 18. The method of claim 17further comprising, in response to a data retrieval request receivedfrom the client and including a data object identifier, accessing thelocal memory in order to determine a current storage location of thatdata object.
 19. The method of claim 18 further comprising accessing themetadata memory when a current storage location of that data objectcannot be determined from the local memory.
 20. The method of claim 11further comprising, upon initialization of the storage interfacemodules, mapping logical addresses to physical locations in each of thestorage devices based on the topology of the storage devices. 21.Apparatus for providing high-performance, scalable data storage servicesfrom a plurality of disks to a client in response to data storagerequests, each data storage request including a data object identifierthat identifies a data object to be stored, the apparatus comprising: aplurality of storage interface modules, each of which stores datadirected to logical addresses into physical locations in each of thedisks; and a host interface module that receives data storage requestsand, in response to each storage request and based on relativeactivities of the storage interface modules, dynamically selects logicaladdresses to which a data object identified by that request is stored,so that data storage activity will be dynamically distributed across theplurality of disks.